Final Project
This course can be found in udacity ud170.

Data Analysis Process

Setting Up Your System

Otherwise, you can find the free course here.

Intro to CSVs

If you’d like to learn more about data wrangling, check out the Udacity course Data Wrangling with MongoDB.

CSVs in Python

https://s3.cn-north-1.amazonaws.com.cn/u-vid-hd/22sQCo6ovH0.mp4

import unicodecsv
enrollments_filename = '/datasets/ud170/udacity-students/enrollments.csv'
## Longer version of code (replaced with shorter, equivalent version below)
# enrollments = []
# f = open(enrollments_filename, 'rb')
# reader = unicodecsv.DictReader(f)
# for row in reader:
#     enrollments.append(row)
# f.close()
with open(enrollments_filename, 'rb') as f:
    reader = unicodecsv.DictReader(f)
    enrollments = list(reader)
    
### Write code similar to the above to load the engagement
### and submission data. The data is stored in files with
### the given filenames. Then print the first row of each
### table to make sure that your code works. You can use the
### "Test Run" button to see the output of your code.
engagement_filename = '/datasets/ud170/udacity-students/daily_engagement.csv'
submissions_filename = '/datasets/ud170/udacity-students/project_submissions.csv'
    
daily_engagement = None     # Replace this with your code
project_submissions = None  # Replace this with your code

with open(engagement_filename, 'rb') as f:
    reader = unicodecsv.DictReader(f)
    daily_engagement = list(reader)    
print daily_engagement[0]
with open(submissions_filename, 'rb') as f:
    reader = unicodecsv.DictReader(f)
    project_submissions = list(reader)
print project_submissions[0]

Python’s csv Module

This page contains documentation for Python’s csv module. Instead of csv, you’ll be using unicodecsv in this course. unicodecsv works exactly the same as csv, but it comes with Anaconda and has support for unicode. The csv documentation page is still the best way to learn how to use the unicodecsv library, since the two libraries work exactly the same way.

Iterators in Python

This page explains the difference between iterators and lists in Python, and how to use iterators.

Solutions

DAND students click here for solution code

IPND students: Look at the end of this lesson for Quiz Solutions

Fixing Data Types

https://s3.cn-north-1.amazonaws.com.cn/u-vid-hd/7NSYtdVrlRE.mp4

Questions about Student Data

https://s3.cn-north-1.amazonaws.com.cn/u-vid-hd/AO8vSyAtfV4.mp4

Investigating the Data

Now you’ve started the data wrangling process by loading the data and making sure it’s in a good format. The next step is to investigate a bit and see if there are any inconsistencies or problems in the data that you’ll need to clean up.

For each of the three files you’ve loaded, find the total number of rows in the csv and the number of unique students. To find the number of unique students in each table, you might want to try creating a set of the account keys.

Again, in case you’re not finished with your local setup, you can complete this exercise in the Udacity code editor. You’ll need to run the next exercise locally, though, so if you haven’t finished setting up, you should do that now.

import unicodecsv
def read_csv(filename):
    with open(filename, 'rb') as f:
        reader = unicodecsv.DictReader(f)
        return list(reader)
enrollments = read_csv('/datasets/ud170/udacity-students/enrollments.csv')
daily_engagement = read_csv('/datasets/ud170/udacity-students/daily_engagement.csv')
project_submissions = read_csv('/datasets/ud170/udacity-students/project_submissions.csv')
    
### For each of these three tables, find the number of rows in the table and
### the number of unique students in the table. To find the number of unique
### students, you might want to create a set of the account keys in each table.
enrollment_num_rows = 0             # Replace this with your code
enrollment_num_unique_students = 0  # Replace this with your code
engagement_num_rows = 0             # Replace this with your code
engagement_num_unique_students = 0  # Replace this with your code
submission_num_rows = 0             # Replace this with your code
submission_num_unique_students = 0  # Replace this with your code

def unique_num(data):
    unique_data = set()
    for element in data:
        if 'acct' in element:
            element['account_key'] = element['acct']
            del element['acct']
        unique_data.add(element['account_key'])
    return len(unique_data)
print enrollments[0]
enrollment_num_rows = len(enrollments)             # Replace this with your code
enrollment_num_unique_students = unique_num(enrollments)  # Replace this with your code
print enrollment_num_rows
print enrollment_num_unique_students
    
print daily_engagement[0]
engagement_num_rows = len(daily_engagement)             # Replace this with your code
print engagement_num_rows
engagement_num_unique_students = unique_num(daily_engagement)  # Replace this with your code
print engagement_num_unique_students
print project_submissions[0]
submission_num_rows = len(project_submissions)             # Replace this with your code
submission_num_unique_students = unique_num(project_submissions)  # Replace this with your code
print submission_num_rows
print submission_num_unique_students

Problems in the Data

Removing an Element from a Dictionary

If you’re not sure how to remove an element from a dictionary, this post might be helpful.

Solutions

DAND students click here for solution code

IPND students: Look at the end of this lesson for Quiz Solutions

Updated Code for Previous Exercise

After running the above code, Caroline also shows rewriting the solution from the previous exercise to the following code:

def get_unique_students(data):
    unique_students = set()
    for data_point in data:
        unique_students.add(data_point['account_key'])
    return unique_students
len(enrollments)
unique_enrolled_students = get_unique_students(enrollments)
len(unique_enrolled_students)
len(daily_engagement)
unique_engagement_students = get_unique_students(daily_engagement)
len(unique_engagement_students)
len(project_submissions)
unique_project_submitters = get_unique_students(project_submissions)
len(unique_project_submitters)

Missing Engagement Records

Printing a Single Row
This page describes how to use Python’s break statement, which might be helpful for printing only a single problem record.

Solutions
DAND students click here for solution code

IPND students: Look at the end of this lesson for Quiz Solutions

Checking for More Problem Records

Tracking Down the Remaining Problems

Refining the Question

Exploratory Data Analysis

If you’d like to learn more about the exploratory phase of the data analysis process, check out the Udacity course Data Analysis with R.

Solutions

DAND students click here for solution code

IPND students: Look at the end of this lesson for Quiz Solutions

Getting Data from First Week

https://s3.cn-north-1.amazonaws.com.cn/u-vid-hd/adqc5fF5B8Y.mp4
https://classroom.udacity.com/courses/ud170/lessons/5430778793/concepts/53961386350923

Note that paid students may have canceled from other courses before paying, and the suggested solution will retain records from these other enrollments.

Indulge Curiosity

Exploring Student Engagement

Debugging Data Analysis Code

Lessons Completed in First Week

Number of Visits in the First Week

https://s3.cn-north-1.amazonaws.com.cn/u-vid-hd/5GYA5j1fqBU.mp4
https://classroom.udacity.com/courses/ud170/lessons/5430778793/concepts/53961386450923

Splitting out Passing Students

Quiz: Comparing the Two Student Groups

Quiz: Making Histograms

Visualizing data

Even though you know the mean, standard deviation, maximum, and minimum of various metrics, there are a lot of other facts about each metric that would be nice to know. Are more values close to the minimum or the maximum? What is the median? And so on.

Instead of printing out more statistics, at this point it makes sense to visualize the data using a histogram.

Making histograms in Python

To make a histogram in Python, you can use the matplotlib library, which comes with Anaconda. The following code will make a histogram of an example list of data points called data.

data = [1, 2, 1, 3, 3, 1, 4, 2]
%matplotlib inline
import matplotlib.pyplot as plt
plt.hist(data)

The line %matplotlib inline is specifically for IPython notebook, and causes your plots to appear in your notebook rather than a new window. If you are not using IPython notebook, you should not include this line, and instead you should add the line plt.show() at the bottom to show the plot in a new window.

Making histograms of student data

Now use this method to make a histogram of each of the three metrics we looked at for both students who pass the subway project and students who don’t. That is, you should create 6 histograms. Do any of the metrics have histograms with very different shapes for students who pass the subway project vs. those who don’t?

You can also create histograms of the metrics you explored on your own if you’d like.

Are your Results Just Noise?

Statistics

If you’d like to learn more about statistics, which you can use to rigorously determine how likely it is that your results are due to chance, check out the Udacity courses Intro to Descriptive Statistics and Intro to Inferential Statistics.

Correlation Does Not Imply Causation

Cheese and Bedsheet Tangling

To see the plot shown in the video, as well as many other amusing or strange correlations, check out this website.

A/B Testing

To learn more about using online experiments to determine whether one change causes another, take the Udacity course A/B Testing.

Predicting Based on Many Features

Machine Learning

To learn more about using machine learning to automatically make predictions, take the Udacity course Intro to Machine Learning.

Communication

Adding labels and titles

In matplotlib, you can add axis labels using plt.xlabel("Label for x axis") and plt.ylabel("Label for y axis"). For histograms, you usually only need an x-axis label, but for other plot types a y-axis label may also be needed. You can also add a title using plt.title("Title of plot").

Making plots look nicer with seaborn

You can automatically make matplotlib plots look nicer using the seaborn library. This library is not automatically included with Anaconda, but Anaconda includes something called a package manager to make it easier to add new libraries. The package manager is called conda, and to use it, you should open the Command Prompt (on a PC) or terminal (on Mac or Linux), and type the command conda install seaborn.

If you are using a different Python installation than Anaconda, you may have a different package manager. The most common ones are pip and easy_install, and you can use them with the commands pip install seaborn or easy_install seaborn respectively.

Once you have installed seaborn, you can import it anywhere in your code using the line import seaborn as sns. Then any plot you make afterwards will automatically look better. Give it a try!

If you’re wondering why the abbreviation for seaborn is sns, it’s because seaborn was named after the character Samuel Norman Seaborn from the show The West Wing, and sns are his initials.

The seaborn package also includes some extra functions you can use to make complex plots that would be difficult in matplotlib. We won’t be covering those in this course, but if you’d like to see what functions seaborn has available, you can look through the documentation.

Adding extra arguments to your plot

You’ll also frequently want to add some arguments to your plot to tune how it looks. You can see what arguments are available on the documentation page for the hist function. One common argument to pass is the bins argument, which sets the number of bins used by your histogram. For example, plt.hist(data, bins=20) would make sure your histogram has 20 bins.

Improving one of your plots

Use these techniques to improve at least one of the plots you made earlier.

Finally, decide which of the discoveries you made this lesson you would most want to communicate to someone else, and write a forum post sharing your findings.

Conclusion

L1_Solution_Code.ipynb

Quiz Solutions

CSVs in Python

import unicodecsv
def read_csv(filename):
    with open(filename, 'rb') as f:
        reader = unicodecsv.DictReader(f)
        return list(reader)
enrollments = read_csv('enrollments.csv')
daily_engagement = read_csv('daily_engagement.csv')
project_submissions = read_csv('project_submissions.csv')

Investigating the Data

len(enrollments)
unique_enrolled_students = set()
for enrollment in enrollments:
    unique_enrolled_students.add(enrollment['account_key'])
len(unique_enrolled_students)
len(daily_engagement)
unique_engagement_students = set()
for engagement_record in daily_engagement:
    unique_engagement_students.add(engagement_record['acct'])
len(unique_engagement_students)
len(project_submissions)
unique_project_submitters = set()
for submission in project_submissions:
    unique_project_submitters.add(submission['account_key'])
len(unique_project_submitters)

Problems in the Data

1
2
3

for engagement_record in daily_engagement:
    engagement_record['account_key'] = engagement_record['acct']
    del[engagement_record['acct']]

Missing engagement records

for enrollment in enrollments:
    student = enrollment['account_key']
    if student not in unique_engagement_students:
        print enrollment
        break

Checking for more problem records

num_problem_students = 0
for enrollment in enrollments:
    student = enrollment['account_key']
    if (student not in unique_engagement_students and 
            enrollment['join_date'] != enrollment['cancel_date']):
        print enrollment
        num_problem_students += 1

num_problem_students

Refining the Question

paid_students = {}
for enrollment in non_udacity_enrollments:
    if (not enrollment['is_canceled'] or
            enrollment['days_to_cancel'] > 7):
        account_key = enrollment['account_key']
        enrollment_date = enrollment['join_date']
        if (account_key not in paid_students or
                enrollment_date > paid_students[account_key]):
            paid_students[account_key] = enrollment_date
len(paid_students)

Note that if you switch the order of the second if statement like so

if (enrollment_date > paid_students[account_key] or
account_key not in paid_students)
you will most likely get an error. Why do you think that is? Check out this Stackoverflow discussion to find out more: http://stackoverflow.com/questions/13960657/does-python-evaluate-ifs-conditions-lazily

Getting Data from First Week

def within_one_week(join_date, engagement_date):
    time_delta = engagement_date - join_date
    return time_delta.days < 7
def remove_free_trial_cancels(data):
    new_data = []
    for data_point in data:
        if data_point['account_key'] in paid_students:
            new_data.append(data_point)
    return new_data
paid_enrollments = remove_free_trial_cancels(non_udacity_enrollments)
paid_engagement = remove_free_trial_cancels(non_udacity_engagement)
paid_submissions = remove_free_trial_cancels(non_udacity_submissions)
print len(paid_enrollments)
print len(paid_engagement)
print len(paid_submissions)
paid_engagement_in_first_week = []
for engagement_record in paid_engagement:
    account_key = engagement_record['account_key']
    join_date = paid_students[account_key]
    engagement_record_date = engagement_record['utc_date']
    if within_one_week(join_date, engagement_record_date):
         paid_engagement_in_first_week.append(engagement_record)
len(paid_engagement_in_first_week)

Debugging Data Analysis Code

Here is the code Caroline shows in the solution video:

student_with_max_minutes = None
max_minutes = 0
for student, total_minutes in total_minutes_by_account.items():
    if total_minutes > max_minutes:
        max_minutes = total_minutes
        student_with_max_minutes = student
max_minutes
for engagement_record in paid_engagement_in_first_week:
    if engagement_record['account_key'] == student_with_max_minutes:
        print engagement_record

Alternatively, you can find the account key with the maximum minutes using this shorthand notation:

1	max(total_minutes_by_account.items(), key=lambda pair: pair[1])

Fixing Bug in within_one_week()

She also updated the code for the within_one_week function to the following:

1
2
3

def within_one_week(join_date, engagement_date):
    time_delta = engagement_date - join_date
    return time_delta.days >= 0 and time_delta.days < 7

Lessons Completed in First Week

First, Caroline refactors the given code to analyze total minutes spent in the first week into the following:

from collections import defaultdict
def group_data(data, key_name):
    grouped_data = defaultdict(list)
    for data_point in data:
        key = data_point[key_name]
        grouped_data[key].append(data_point)
    return grouped_data
engagement_by_account = group_data(paid_engagement_in_first_week,
                                   'account_key')
def sum_grouped_items(grouped_data, field_name):
    summed_data = {}
    for key, data_points in grouped_data.items():
        total = 0
        for data_point in data_points:
            total += data_point[field_name]
        summed_data[key] = total
    return summed_data
total_minutes_by_account = sum_grouped_items(engagement_by_account,
                                             'total_minutes_visited')
import numpy as np
def describe_data(data):
    print 'Mean:', np.mean(data)
    print 'Standard deviation:', np.std(data)
    print 'Minimum:', np.min(data)
    print 'Maximum:', np.max(data)
describe_data(total_minutes_by_account.values())

Then she called the functions she created to analyze the lessons completed in the first week as follows:

1
2
3

lessons_completed_by_account = sum_grouped_items(engagement_by_account,
                                                 'lessons_completed')
describe_data(lessons_completed_by_account.values())

Number of Visits in the First Week

Here is the code Caroline shows in the solution video. First she ran this code to create the has_visited field:

for engagement_record in paid_engagement:
    if engagement_record['num_courses_visited'] > 0:
        engagement_record['has_visited'] = 1
    else:
        engagement_record['has_visited'] = 0

Then, after recreating the engagement_by_account dictionary with the updated data, she ran the following code to analyze days visited in the first week:

1
2
3

days_visited_by_account = sum_grouped_items(engagement_by_account,
                                            'has_visited')
describe_data(days_visited_by_account.values())

Splitting out Passing Students

Here is the code Caroline shows in the solution video:

subway_project_lesson_keys = ['746169184', '3176718735']
pass_subway_project = set()
for submission in paid_submissions:
    project = submission['lesson_key']
    rating = submission['assigned_rating']    
    if ((project in subway_project_lesson_keys) and
            (rating == 'PASSED' or rating == 'DISTINCTION')):
        pass_subway_project.add(submission['account_key'])
len(pass_subway_project)
passing_engagement = []
non_passing_engagement = []
for engagement_record in paid_engagement_in_first_week:
    if engagement_record['account_key'] in pass_subway_project:
        passing_engagement.append(engagement_record)
    else:
        non_passing_engagement.append(engagement_record)
print len(passing_engagement)
print len(non_passing_engagement)

Comparing the Two Student Groups

Here is the code Caroline shows in the solution video:

passing_engagement_by_account = group_data(passing_engagement,
                                           'account_key')
non_passing_engagement_by_account = group_data(non_passing_engagement,
                                               'account_key')
print 'non-passing students:'
non_passing_minutes = sum_grouped_items(
    non_passing_engagement_by_account,
    'total_minutes_visited'
)
describe_data(non_passing_minutes.values())
print 'passing students:'
passing_minutes = sum_grouped_items(
    passing_engagement_by_account,
    'total_minutes_visited'
)
describe_data(passing_minutes.values())
print 'non-passing students:'
non_passing_lessons = sum_grouped_items(
    non_passing_engagement_by_account,
    'lessons_completed'
)
describe_data(non_passing_lessons.values())
print 'passing students:'
passing_lessons = sum_grouped_items(
    passing_engagement_by_account,
    'lessons_completed'
)
describe_data(passing_lessons.values())
print 'non-passing students:'
non_passing_visits = sum_grouped_items(
    non_passing_engagement_by_account, 
    'has_visited'
)
describe_data(non_passing_visits.values())
print 'passing students:'
passing_visits = sum_grouped_items(
    passing_engagement_by_account,
    'has_visited'
)
describe_data(passing_visits.values())

Making Histograms

Here is the code Caroline shows in the solution video:

%pylab inline
import matplotlib.pyplot as plt
import numpy as np
# Summarize the given data
def describe_data(data):
    print 'Mean:', np.mean(data)
    print 'Standard deviation:', np.std(data)
    print 'Minimum:', np.min(data)
    print 'Maximum:', np.max(data)
    plt.hist(data)

Fixing the Number of Bins

To change how many bins are shown for each plot, try using the bins argument to the hist function. You can find documentation for the hist function and the arguments it takes here.

Here is the code Caroline shows in the solution video:

import seaborn as sns
plt.hist(non_passing_visits.values(), bins=8)
plt.xlabel('Number of days')
plt.title('Distribution of classroom visits in the first week ' + 
          'for students who do not pass the subway project')
plt.hist(passing_visits.values(), bins=8)
plt.xlabel('Number of days')
plt.title('Distribution of classroom visits in the first week ' + 
          'for students who pass the subway project')

Quiz: Survey Says!

Numpy and Pandas for 1D Data

Introduction

Quiz: Gapminder Data

Gapminder data
The data in this lesson was obtained from the site gapminder.org. The variables included are:

Aged 15+ Employment Rate (%)
Life Expectancy (years)
GDP/capita (US$, inflation adjusted)
Primary school completion (% of boys)
Primary school completion (% of girls)
You can also obtain the data to anlayze on your own from the Downloadables section.

One-Dimensional Data in NumPy and Pandas

Quiz: NumPy Arrays

Pandas	Numpy
Series	Array

similarity and difference between numpy array and python list

similarity	difference
for loop	numpy array have the same type

import numpy as np
# First 20 countries with employment data
countries = np.array([
    'Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina',
    'Armenia', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas',
    'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium',
    'Belize', 'Benin', 'Bhutan', 'Bolivia',
    'Bosnia and Herzegovina'
])
# Employment data in 2007 for those 20 countries
employment = np.array([
    55.70000076,  51.40000153,  50.5       ,  75.69999695,
    58.40000153,  40.09999847,  61.5       ,  57.09999847,
    60.90000153,  66.59999847,  60.40000153,  68.09999847,
    66.90000153,  53.40000153,  48.59999847,  56.79999924,
    71.59999847,  58.40000153,  70.40000153,  41.20000076
])
# Change False to True for each block of code to see what it does
# Accessing elements
if False:
    print countries[0]
    print countries[3]
# Slicing
if False:
    print countries[0:3]
    print countries[:3]
    print countries[17:]
    print countries[:]
# Element types
if False:
    print countries.dtype
    print employment.dtype
    print np.array([0, 1, 2, 3]).dtype
    print np.array([1.0, 1.5, 2.0, 2.5]).dtype
    print np.array([True, False, True]).dtype
    print np.array(['AL', 'AK', 'AZ', 'AR', 'CA']).dtype
# Looping
if False:
    for country in countries:
        print 'Examining country {}'.format(country)
    for i in range(len(countries)):
        country = countries[i]
        country_employment = employment[i]
        print 'Country {} has employment {}'.format(country,
                country_employment)
# Numpy functions
if False:
    print employment.mean()
    print employment.std()
    print employment.max()
    print employment.sum()
def max_employment(countries, employment):
    '''
    Fill in this function to return the name of the country
    with the highest employment in the given employment
    data, and the employment in that country.
    '''
    max_country = None      # Replace this with your code
    max_value = None   # Replace this with your code
    return (max_country, max_value)

solution

def max_employment(countries, employment):
    '''
    Fill in this function to return the name of the country
    with the highest employment in the given employment
    data, and the employment in that country.
    '''
    max_country = countries[employment.argmax()]      # Replace this with your code
    max_value = employment.max()   # Replace this with your code
    return (max_country, max_value)

argmax() return the position of max()

Quiz: Vectorized Operations

+ operation:
python | numpy
—|—
list concatenation | vector addition

Quiz: Multiplying by a Scalar

Quiz: Calculate Overall Completion Rate

Bitwise Operations
See this article for more information about bitwise operations.

In NumPy, a & b performs a bitwise and of a and b. This is not necessarily the same as a logical and, if you wanted to see if matching terms in two integer vectors were non-zero. However, if a and b are both arrays of booleans, rather than integers, bitwise and and logical and are the same thing. If you want to perform a logical and on integer vectors, then you can use the NumPy function np.logical_and(a, b) or convert them into boolean vectors first.

Similarly, a | b performs a bitwise or, and ~a performs a bitwise not. However, if your arrays contain booleans, these will be the same as performing logical or and logical not. NumPy also has similar functions for performing these logical operations on integer-valued arrays.

For the quiz, assume that the number of males and females are equal i.e. we can take a simple average to get an overall completion rate.

In the solution, we may want to / 2. instead of just / 2. This is because in Python 2, dividing an integer by another integer (2) drops fractions, so if our inputs are also integers, we may end up losing information. If we divide by a float (2.) then we will definitely retain decimal values.

Erratum: The output of cell [3] in the solution video is incorrect: it appears that the male variable has not been set to the proper value set in cell [2]. All values except for the first will be different. The correct output in cell Out[3]: should instead start with:

array([ 192.83205, 205.28855, 202.82258, 186.63257, 206.91115,

import numpy as np
# Change False to True for each block of code to see what it does
# Arithmetic operations between 2 NumPy arrays
if False:
    a = np.array([1, 2, 3, 4])
    b = np.array([1, 2, 1, 2])
    
    print a + b
    print a - b
    print a * b
    print a / b
    print a ** b
    
# Arithmetic operations between a NumPy array and a single number
if False:
    a = np.array([1, 2, 3, 4])
    b = 2
    
    print a + b
    print a - b
    print a * b
    print a / b
    print a ** b
    
# Logical operations with NumPy arrays
if False:
    a = np.array([True, True, False, False])
    b = np.array([True, False, True, False])
    
    print a & b
    print a | b
    print ~a
    
    print a & True
    print a & False
    
    print a | True
    print a | False
    
# Comparison operations between 2 NumPy Arrays
if False:
    a = np.array([1, 2, 3, 4, 5])
    b = np.array([5, 4, 3, 2, 1])
    
    print a > b
    print a >= b
    print a < b
    print a <= b
    print a == b
    print a != b
    
# Comparison operations between a NumPy array and a single number
if False:
    a = np.array([1, 2, 3, 4])
    b = 2
    
    print a > b
    print a >= b
    print a < b
    print a <= b
    print a == b
    print a != b
    
# First 20 countries with school completion data
countries = np.array([
       'Algeria', 'Argentina', 'Armenia', 'Aruba', 'Austria','Azerbaijan',
       'Bahamas', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Bolivia',
       'Botswana', 'Brunei', 'Bulgaria', 'Burkina Faso', 'Burundi',
       'Cambodia', 'Cameroon', 'Cape Verde'
])
# Female school completion rate in 2007 for those 20 countries
female_completion = np.array([
    97.35583,  104.62379,  103.02998,   95.14321,  103.69019,
    98.49185,  100.88828,   95.43974,   92.11484,   91.54804,
    95.98029,   98.22902,   96.12179,  119.28105,   97.84627,
    29.07386,   38.41644,   90.70509,   51.7478 ,   95.45072
])
# Male school completion rate in 2007 for those 20 countries
male_completion = np.array([
     95.47622,  100.66476,   99.7926 ,   91.48936,  103.22096,
     97.80458,  103.81398,   88.11736,   93.55611,   87.76347,
    102.45714,   98.73953,   92.22388,  115.3892 ,   98.70502,
     37.00692,   45.39401,   91.22084,   62.42028,   90.66958
])
def overall_completion_rate(female_completion, male_completion):
    '''
    Fill in this function to return a NumPy array containing the overall
    school completion rate for each country. The arguments are NumPy
    arrays giving the female and male completion of each country in
    the same order.
    '''
    return None

solution

def overall_completion_rate(female_completion, male_completion):
    '''
    Fill in this function to return a NumPy array containing the overall
    school completion rate for each country. The arguments are NumPy
    arrays giving the female and male completion of each country in
    the same order.
    '''
    return (female_completion + male_completion ) /2.

Quiz: Standardizing Data

quiz

import numpy as np
# First 20 countries with employment data
countries = np.array([
    'Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina',
    'Armenia', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas',
    'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium',
    'Belize', 'Benin', 'Bhutan', 'Bolivia',
    'Bosnia and Herzegovina'
])
# Employment data in 2007 for those 20 countries
employment = np.array([
    55.70000076,  51.40000153,  50.5       ,  75.69999695,
    58.40000153,  40.09999847,  61.5       ,  57.09999847,
    60.90000153,  66.59999847,  60.40000153,  68.09999847,
    66.90000153,  53.40000153,  48.59999847,  56.79999924,
    71.59999847,  58.40000153,  70.40000153,  41.20000076
])
# Change this country name to change what country will be printed when you
# click "Test Run". Your function will be called to determine the standardized
# score for this country for each of the given 5 Gapminder variables in 2007.
# The possible country names are available in the Downloadables section.
country_name = 'United States'
def standardize_data(values):
    '''
    Fill in this function to return a standardized version of the given values,
    which will be in a NumPy array. Each value should be translated into the
    number of standard deviations that value is away from the mean of the data.
    (A positive number indicates a value higher than the mean, and a negative
    number indicates a value lower than the mean.)
    '''
    return None

solution

1 2	def standardize_data(values): return (values - values.mean()) / values.std()

Quiz: NumPy Index Arrays

quiz

import numpy as np
# Change False to True for each block of code to see what it does
# Using index arrays
if False:
    a = np.array([1, 2, 3, 4])
    b = np.array([True, True, False, False])
    
    print a[b]
    print a[np.array([True, False, True, False])]
    
# Creating the index array using vectorized operations
if False:
    a = np.array([1, 2, 3, 2, 1])
    b = (a >= 2)
    
    print a[b]
    print a[a >= 2]
    
# Creating the index array using vectorized operations on another array
if False:
    a = np.array([1, 2, 3, 4, 5])
    b = np.array([1, 2, 3, 2, 1])
    
    print b == 2
    print a[b == 2]
def mean_time_for_paid_students(time_spent, days_to_cancel):
    '''
    Fill in this function to calculate the mean time spent in the classroom
    for students who stayed enrolled at least (greater than or equal to) 7 days.
    Unlike in Lesson 1, you can assume that days_to_cancel will contain only
    integers (there are no students who have not canceled yet).
    
    The arguments are NumPy arrays. time_spent contains the amount of time spent
    in the classroom for each student, and days_to_cancel contains the number
    of days until each student cancel. The data is given in the same order
    in both arrays.
    '''
    return None
# Time spent in the classroom in the first week for 20 students
time_spent = np.array([
       12.89697233,    0.        ,   64.55043217,    0.        ,
       24.2315615 ,   39.991625  ,    0.        ,    0.        ,
      147.20683783,    0.        ,    0.        ,    0.        ,
       45.18261617,  157.60454283,  133.2434615 ,   52.85000767,
        0.        ,   54.9204785 ,   26.78142417,    0.
])
# Days to cancel for 20 students
days_to_cancel = np.array([
      4,   5,  37,   3,  12,   4,  35,  38,   5,  37,   3,   3,  68,
     38,  98,   2, 249,   2, 127,  35
])

sloution

1 2	def mean_time_for_paid_students(time_spent, days_to_cancel): return time_spent[days_to_cancel >=7].mean()

Quiz: + vs. +=

notice

import numpy as np
a = np.array([1,2,3,4])
b = a
a += np.array([1,1,1,])
print b

array([2,3,4,5])

import numpy as np
a = np.array([1,2,3,4])
b = a
a = a + np.array([1,1,1,])
print b

array([1,2,3,4])

Quiz: In-Place vs. Not In-Place

notice

import numpy as np
a = np.array([1,2,3,4])
slice = a[:3]
slice[0] = 100
print a

array([100,2,3,4])

Quiz: Pandas Series

quiz

import pandas as pd
countries = ['Albania', 'Algeria', 'Andorra', 'Angola', 'Antigua and Barbuda',
             'Argentina', 'Armenia', 'Australia', 'Austria', 'Azerbaijan',
             'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus',
             'Belgium', 'Belize', 'Benin', 'Bhutan', 'Bolivia']
life_expectancy_values = [74.7,  75. ,  83.4,  57.6,  74.6,  75.4,  72.3,  81.5,  80.2,
                          70.3,  72.1,  76.4,  68.1,  75.2,  69.8,  79.4,  70.8,  62.7,
                          67.3,  70.6]
gdp_values = [ 1681.61390973,   2155.48523109,  21495.80508273,    562.98768478,
              13495.1274663 ,   9388.68852258,   1424.19056199,  24765.54890176,
              27036.48733192,   1945.63754911,  21721.61840978,  13373.21993972,
                483.97086804,   9783.98417323,   2253.46411147,  25034.66692293,
               3680.91642923,    366.04496652,   1175.92638695,   1132.21387981]
# Life expectancy and gdp data in 2007 for 20 countries
life_expectancy = pd.Series(life_expectancy_values)
gdp = pd.Series(gdp_values)
# Change False to True for each block of code to see what it does
# Accessing elements and slicing
if False:
    print life_expectancy[0]
    print gdp[3:6]
    
# Looping
if False:
    for country_life_expectancy in life_expectancy:
        print 'Examining life expectancy {}'.format(country_life_expectancy)
        
# Pandas functions
if False:
    print life_expectancy.mean()
    print life_expectancy.std()
    print gdp.max()
    print gdp.sum()
# Vectorized operations and index arrays
if False:
    a = pd.Series([1, 2, 3, 4])
    b = pd.Series([1, 2, 1, 2])
  
    print a + b
    print a * 2
    print a >= 3
    print a[a >= 3]
   
def variable_correlation(variable1, variable2):
    '''
    Fill in this function to calculate the number of data points for which
    the directions of variable1 and variable2 relative to the mean are the
    same, and the number of data points for which they are different.
    Direction here means whether each value is above or below its mean.
    
    You can classify cases where the value is equal to the mean for one or
    both variables however you like.
    
    Each argument will be a Pandas series.
    
    For example, if the inputs were pd.Series([1, 2, 3, 4]) and
    pd.Series([4, 5, 6, 7]), then the output would be (4, 0).
    This is because 1 and 4 are both below their means, 2 and 5 are both
    below, 3 and 6 are both above, and 4 and 7 are both above.
    
    On the other hand, if the inputs were pd.Series([1, 2, 3, 4]) and
    pd.Series([7, 6, 5, 4]), then the output would be (0, 4).
    This is because 1 is below its mean but 7 is above its mean, and
    so on.
    '''
    num_same_direction = None        # Replace this with your code
    num_different_direction = None   # Replace this with your code
    
    return (num_same_direction, num_different_direction)

solution

def variable_correlation(variable1, variable2):
    both_above = (variable1 > variable1.mean()) & \
                 (variable2 > variable2.mean())
    both_below = (variable1 < variable1.mean()) & \
                 (variable2 < variable2.mean())
    is_same_direction = both_above | both_below
    num_same_direction = is_same_direction.sum()        # Replace this with your code
    num_different_direction = len(variable1) - num_same_direction   # Replace this with your code
    
    return (num_same_direction, num_different_direction)

Quiz: Series Indexes

s.describe() s.loc[INDEX] s.iloc[0]
Pandas idxmax()
Note: The argmax() function mentioned in the videos has been realiased to idxmax(), and returns the index of the first maximally-valued element. You can find documentation for the idxmax() function in Pandas here.
quiz

import pandas as pd
countries = [
    'Afghanistan', 'Albania', 'Algeria', 'Angola',
    'Argentina', 'Armenia', 'Australia', 'Austria',
    'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh',
    'Barbados', 'Belarus', 'Belgium', 'Belize',
    'Benin', 'Bhutan', 'Bolivia', 'Bosnia and Herzegovina',
]
employment_values = [
    55.70000076,  51.40000153,  50.5       ,  75.69999695,
    58.40000153,  40.09999847,  61.5       ,  57.09999847,
    60.90000153,  66.59999847,  60.40000153,  68.09999847,
    66.90000153,  53.40000153,  48.59999847,  56.79999924,
    71.59999847,  58.40000153,  70.40000153,  41.20000076,
]
# Employment data in 2007 for 20 countries
employment = pd.Series(employment_values, index=countries)
def max_employment(employment):
    '''
    Fill in this function to return the name of the country
    with the highest employment in the given employment
    data, and the employment in that country.
    
    The input will be a Pandas series where the values
    are employment and the index is country names.
    
    Try using the Pandas idxmax() function. Documention can
    be found here:
    http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.idxmax.html
    '''
    max_country = None      # Replace this with your code
    max_value = None   # Replace this with your code
    return (max_country, max_value)

solution

def max_employment(employment):
    max_country = employment.idxmax()     # Replace this with your code
    max_value = employment.loc[max_country]   # Replace this with your code
    return (max_country, max_value)

Quiz: Vectorized Operations and Series Indexes

quiz

import pandas as pd
# Change False to True for each block of code to see what it does
# Addition when indexes are the same
if False:
    s1 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
    s2 = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
    print s1 + s2
# Indexes have same elements in a different order
if False:
    s1 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
    s2 = pd.Series([10, 20, 30, 40], index=['b', 'd', 'a', 'c'])
    print s1 + s2
# Indexes overlap, but do not have exactly the same elements
if False:
    s1 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
    s2 = pd.Series([10, 20, 30, 40], index=['c', 'd', 'e', 'f'])
    print s1 + s2
# Indexes do not overlap
if False:
    s1 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
    s2 = pd.Series([10, 20, 30, 40], index=['e', 'f', 'g', 'h'])
    print s1 + s2

Quiz: Filling Missing Values

Remember that Jupyter notebooks will just print out the results of the last expression run in a code cell as though a print expression was run. If you want to save the results of your operations for later, remember to assign the results to a variable or, for some Pandas functions like .dropna(), use inplace = True to modify the starting object without needing to reassign it.
quiz

import pandas as pd
s1 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
s2 = pd.Series([10, 20, 30, 40], index=['c', 'd', 'e', 'f'])
# Try to write code that will add the 2 previous series together,
# but treating missing values from either series as 0. The result
# when printed out should be similar to the following line:
# print pd.Series([1, 2, 13, 24, 30, 40], index=['a', 'b', 'c', 'd', 'e', 'f'])

solution

1	s1.add(s2, fill_value=0)

Quiz: Pandas Series apply()

Note: The grader will execute your finished reverse_names(names) function on some test names Series when you submit your answer. Make sure that this function returns another Series with the transformed names.

split()
You can find documentation for Python’s split() function here.
quiz

import pandas as pd
# Change False to True to see what the following block of code does
# Example pandas apply() usage (although this could have been done
# without apply() using vectorized operations)
if False:
    s = pd.Series([1, 2, 3, 4, 5])
    def add_one(x):
        return x + 1
    print s.apply(add_one)
names = pd.Series([
    'Andre Agassi',
    'Barry Bonds',
    'Christopher Columbus',
    'Daniel Defoe',
    'Emilio Estevez',
    'Fred Flintstone',
    'Greta Garbo',
    'Humbert Humbert',
    'Ivan Ilych',
    'James Joyce',
    'Keira Knightley',
    'Lois Lane',
    'Mike Myers',
    'Nick Nolte',
    'Ozzy Osbourne',
    'Pablo Picasso',
    'Quirinus Quirrell',
    'Rachael Ray',
    'Susan Sarandon',
    'Tina Turner',
    'Ugueth Urbina',
    'Vince Vaughn',
    'Woodrow Wilson',
    'Yoji Yamada',
    'Zinedine Zidane'
])
def reverse_names(names):
    '''
    Fill in this function to return a new series where each name
    in the input series has been transformed from the format
    "Firstname Lastname" to "Lastname, FirstName".
    
    Try to use the Pandas apply() function rather than a loop.
    '''
    return None

solution

def reverse_name(name):
    name = name.split(" ")
    print "name: "
    print name
    first_name = name[1]
    last_name = name[0]
    return first_name + ', ' + last_name
print "reverse_name: "
print reverse_name(names.iloc[0])
def reverse_names(names):
    '''
    Fill in this function to return a new series where each name
    in the input series has been transformed from the format
    "Firstname Lastname" to "Lastname, FirstName".
    
    Try to use the Pandas apply() function rather than a loop.
    '''
    return names.apply(reverse_name)
print "reverse: "
reverse_names(names)

Quiz: Plotting in Pandas

If the variable data is a NumPy array or a Pandas Series, just like if it is a list, the code

1 2	import matplotlib.pyplot as plt plt.hist(data)

will create a histogram of the data.

Pandas also has built-in plotting that uses matplotlib behind the scenes, so if data is a Series, you can create a histogram using data.hist().

There’s no difference between these two in this case, but sometimes the Pandas wrapper can be more convenient. For example, you can make a line plot of a series using data.plot(). The index of the Series will be used for the x-axis and the values for the y-axis.

In the following quiz, we’ve created Series containing the various variables we’ve been looking at this lesson. Pick a country you’re interested in, and make a plot of each variable over time.

The Udacity editor will only show one plot each time you click “Test Run”, so you can look at multiple plots by clicking “Test Run” multiple times. If you’re running plotting code locally, you may need to add the line plt.show() depending on your setup.
quiz

import pandas as pd
import seaborn as sns
# The following code reads all the Gapminder data into Pandas DataFrames. You'll
# learn about DataFrames next lesson.
path = '/datasets/ud170/gapminder/'
employment = pd.read_csv(path + 'employment_above_15.csv', index_col='Country')
female_completion = pd.read_csv(path + 'female_completion_rate.csv', index_col='Country')
male_completion = pd.read_csv(path + 'male_completion_rate.csv', index_col='Country')
life_expectancy = pd.read_csv(path + 'life_expectancy.csv', index_col='Country')
gdp = pd.read_csv(path + 'gdp_per_capita.csv', index_col='Country')
# The following code creates a Pandas Series for each variable for the United States.
# You can change the string 'United States' to a country of your choice.
employment_us = employment.loc['United States']
female_completion_us = female_completion.loc['United States']
male_completion_us = male_completion.loc['United States']
life_expectancy_us = life_expectancy.loc['United States']
gdp_us = gdp.loc['United States']
# Uncomment the following line of code to see the available country names
# print employment.index.values
# Use the Series defined above to create a plot of each variable over time for
# the country of your choice. You will only be able to display one plot at a time
# with each "Test Run".

solution

# employment_us.plot()
# female_completion_us.plot()
# male_completion_us.plot()
# life_expectancy_us.plot()
gdp_us.plot()

Conclusion

Numpy and Pandas for 2D Data

Introduction

Quiz: Subway Data

Quiz: Two-Dimensional NumPy Arrays

python: list of lists
numpy: 2D array
pandas: DataFrame
This page describes the memory layout of 2D NumPy arrays.
quiz

import numpy as np
# Subway ridership for 5 stations on 10 different days
ridership = np.array([
    [   0,    0,    2,    5,    0],
    [1478, 3877, 3674, 2328, 2539],
    [1613, 4088, 3991, 6461, 2691],
    [1560, 3392, 3826, 4787, 2613],
    [1608, 4802, 3932, 4477, 2705],
    [1576, 3933, 3909, 4979, 2685],
    [  95,  229,  255,  496,  201],
    [   2,    0,    1,   27,    0],
    [1438, 3785, 3589, 4174, 2215],
    [1342, 4043, 4009, 4665, 3033]
])
# Change False to True for each block of code to see what it does
# Accessing elements
if False:
    print ridership[1, 3]
    print ridership[1:3, 3:5]
    print ridership[1, :]
    
# Vectorized operations on rows or columns
if False:
    print ridership[0, :] + ridership[1, :]
    print ridership[:, 0] + ridership[:, 1]
    
# Vectorized operations on entire arrays
if False:
    a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
    b = np.array([[1, 1, 1], [2, 2, 2], [3, 3, 3]])
    print a + b
def mean_riders_for_max_station(ridership):
    '''
    Fill in this function to find the station with the maximum riders on the
    first day, then return the mean riders per day for that station. Also
    return the mean ridership overall for comparsion.
    
    Hint: NumPy's argmax() function might be useful:
    http://docs.scipy.org/doc/numpy/reference/generated/numpy.argmax.html
    '''
    overall_mean = None # Replace this with your code
    mean_for_max = None # Replace this with your code
    
    return (overall_mean, mean_for_max)

solution

def mean_riders_for_max_station(ridership):
    overall_mean = ridership.mean() # Replace this with your code
    
    station = ridership[0,:].argmax()
    
    mean_for_max = ridership[:,station].mean() # Replace this with your code
    
    return (overall_mean, mean_for_max)

Quiz: NumPy Axis

axis = 0 column
1 row
quiz

import numpy as np
# Change False to True for this block of code to see what it does
# NumPy axis argument
if False:
    a = np.array([
        [1, 2, 3],
        [4, 5, 6],
        [7, 8, 9]
    ])
    
    print a.sum()
    print a.sum(axis=0)
    print a.sum(axis=1)
    
# Subway ridership for 5 stations on 10 different days
ridership = np.array([
    [   0,    0,    2,    5,    0],
    [1478, 3877, 3674, 2328, 2539],
    [1613, 4088, 3991, 6461, 2691],
    [1560, 3392, 3826, 4787, 2613],
    [1608, 4802, 3932, 4477, 2705],
    [1576, 3933, 3909, 4979, 2685],
    [  95,  229,  255,  496,  201],
    [   2,    0,    1,   27,    0],
    [1438, 3785, 3589, 4174, 2215],
    [1342, 4043, 4009, 4665, 3033]
])
def min_and_max_riders_per_day(ridership):
    '''
    Fill in this function. First, for each subway station, calculate the
    mean ridership per day. Then, out of all the subway stations, return the
    maximum and minimum of these values. That is, find the maximum
    mean-ridership-per-day and the minimum mean-ridership-per-day for any
    subway station.
    '''
    max_daily_ridership = None     # Replace this with your code
    min_daily_ridership = None     # Replace this with your code
    
    return (max_daily_ridership, min_daily_ridership)

solution

def min_and_max_riders_per_day(ridership):
    max_daily_ridership = ridership.mean(axis=0).max()     # Replace this with your code
    min_daily_ridership = ridership.mean(axis=0).min()     # Replace this with your code
    
    return (max_daily_ridership, min_daily_ridership)

NumPy and Pandas Data types

Quiz: Accessing Elements of a DataFrame

quiz

import pandas as pd
# Subway ridership for 5 stations on 10 different days
ridership_df = pd.DataFrame(
    data=[[   0,    0,    2,    5,    0],
          [1478, 3877, 3674, 2328, 2539],
          [1613, 4088, 3991, 6461, 2691],
          [1560, 3392, 3826, 4787, 2613],
          [1608, 4802, 3932, 4477, 2705],
          [1576, 3933, 3909, 4979, 2685],
          [  95,  229,  255,  496,  201],
          [   2,    0,    1,   27,    0],
          [1438, 3785, 3589, 4174, 2215],
          [1342, 4043, 4009, 4665, 3033]],
    index=['05-01-11', '05-02-11', '05-03-11', '05-04-11', '05-05-11',
           '05-06-11', '05-07-11', '05-08-11', '05-09-11', '05-10-11'],
    columns=['R003', 'R004', 'R005', 'R006', 'R007']
)
# Change False to True for each block of code to see what it does
# DataFrame creation
if False:
    # You can create a DataFrame out of a dictionary mapping column names to values
    df_1 = pd.DataFrame({'A': [0, 1, 2], 'B': [3, 4, 5]})
    print df_1
    # You can also use a list of lists or a 2D NumPy array
    df_2 = pd.DataFrame([[0, 1, 2], [3, 4, 5]], columns=['A', 'B', 'C'])
    print df_2
   
# Accessing elements
if False:
    print ridership_df.iloc[0]
    print ridership_df.loc['05-05-11']
    print ridership_df['R003']
    print ridership_df.iloc[1, 3]
    
# Accessing multiple rows
if False:
    print ridership_df.iloc[1:4]
    
# Accessing multiple columns
if False:
    print ridership_df[['R003', 'R005']]
    
# Pandas axis
if False:
    df = pd.DataFrame({'A': [0, 1, 2], 'B': [3, 4, 5]})
    print df.sum()
    print df.sum(axis=1)
    print df.values.sum()
    
def mean_riders_for_max_station(ridership):
    '''
    Fill in this function to find the station with the maximum riders on the
    first day, then return the mean riders per day for that station. Also
    return the mean ridership overall for comparsion.
    
    This is the same as a previous exercise, but this time the
    input is a Pandas DataFrame rather than a 2D NumPy array.
    '''
    overall_mean = None # Replace this with your code
    mean_for_max = None # Replace this with your code
    
    return (overall_mean, mean_for_max)

solution

def mean_riders_for_max_station(ridership):
    station = ridership.iloc[0].argmax()
    overall_mean = ridership.values.mean() # Replace this with your code
    mean_for_max = ridership[station].mean() # Replace this with your code
    
    return (overall_mean, mean_for_max)

Loading Data into a DataFrame

Quiz: Calculating Correlation

Understand and Interpreting Correlations

This page contains some scatterplots of variables with different values of correlation.
This page lets you use a slider to change the correlation and see how the data might look.
Pearson’s r only measures linear correlation! This image shows some different linear and non-linear relationships and what Pearson’s r will be for those relationships.

Corrected vs. Uncorrected Standard Deviation

By default, Pandas’ std() function computes the standard deviation using Bessel’s correction. Calling std(ddof=0) ensures that Bessel’s correction will not be used.

Previous Exercise

The exercise where you used a simple heuristic to estimate correlation was the “Pandas Series” exercise in the previous lesson, “NumPy and Pandas for 1D Data”.

Pearson’s r in NumPy

NumPy’s corrcoef() function can be used to calculate Pearson’s r, also known as the correlation coefficient.
quiz

import pandas as pd
filename = '/datasets/ud170/subway/nyc_subway_weather.csv'
subway_df = pd.read_csv(filename)
def correlation(x, y):
    '''
    Fill in this function to compute the correlation between the two
    input variables. Each input is either a NumPy array or a Pandas
    Series.
    
    correlation = average of (x in standard units) times (y in standard units)
    
    Remember to pass the argument "ddof=0" to the Pandas std() function!
    '''
    return None
entries = subway_df['ENTRIESn_hourly']
cum_entries = subway_df['ENTRIESn']
rain = subway_df['meanprecipi']
temp = subway_df['meantempi']
print correlation(entries, rain)
print correlation(entries, temp)
print correlation(rain, temp)
print correlation(entries, cum_entries)

solution

def correlation(x, y):
    std_x = (x - x.mean()) / x.std(ddof=0)
    std_y = (y - y.mean()) / y.std(ddof=0)
    return ( std_x * std_y ).mean()

Pandas Axis Names

1 2	axis=0 axis=1 axis='index' axis='columns'

Quiz: DataFrame Vectorized Operations

Pandas shift()

Documentation for the Pandas shift() function is here. If you’re still not sure how the function works, try it out and see!

Alternative Solution

As an alternative to using vectorized operations, you could also use the code return entries_and_exits.diff() to calculate the answer in a single step.
quiz

import pandas as pd
# Examples of vectorized operations on DataFrames:
# Change False to True for each block of code to see what it does
# Adding DataFrames with the column names
if False:
    df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
    df2 = pd.DataFrame({'a': [10, 20, 30], 'b': [40, 50, 60], 'c': [70, 80, 90]})
    print df1 + df2
    
# Adding DataFrames with overlapping column names 
if False:
    df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
    df2 = pd.DataFrame({'d': [10, 20, 30], 'c': [40, 50, 60], 'b': [70, 80, 90]})
    print df1 + df2
# Adding DataFrames with overlapping row indexes
if False:
    df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]},
                       index=['row1', 'row2', 'row3'])
    df2 = pd.DataFrame({'a': [10, 20, 30], 'b': [40, 50, 60], 'c': [70, 80, 90]},
                       index=['row4', 'row3', 'row2'])
    print df1 + df2
# --- Quiz ---
# Cumulative entries and exits for one station for a few hours.
entries_and_exits = pd.DataFrame({
    'ENTRIESn': [3144312, 3144335, 3144353, 3144424, 3144594,
                 3144808, 3144895, 3144905, 3144941, 3145094],
    'EXITSn': [1088151, 1088159, 1088177, 1088231, 1088275,
               1088317, 1088328, 1088331, 1088420, 1088753]
})
def get_hourly_entries_and_exits(entries_and_exits):
    '''
    Fill in this function to take a DataFrame with cumulative entries
    and exits (entries in the first column, exits in the second) and
    return a DataFrame with hourly entries and exits (entries in the
    first column, exits in the second).
    '''
    return None

solution

1 2	def get_hourly_entries_and_exits(entries_and_exits): return entries_and_exits - entries_and_exits.shift(1)

Quiz: DataFrame applymap()

Note: The grader will execute your finished convert_grades(grades) function on some test grades DataFrames when you submit your answer. Make sure that this function returns a DataFrame with the converted grades. Hint: You may need to define a helper function to use with .applymap().
quiz

import pandas as pd
# Change False to True for this block of code to see what it does
# DataFrame applymap()
if False:
    df = pd.DataFrame({
        'a': [1, 2, 3],
        'b': [10, 20, 30],
        'c': [5, 10, 15]
    })
    
    def add_one(x):
        return x + 1
        
    print df.applymap(add_one)
    
grades_df = pd.DataFrame(
    data={'exam1': [43, 81, 78, 75, 89, 70, 91, 65, 98, 87],
          'exam2': [24, 63, 56, 56, 67, 51, 79, 46, 72, 60]},
    index=['Andre', 'Barry', 'Chris', 'Dan', 'Emilio', 
           'Fred', 'Greta', 'Humbert', 'Ivan', 'James']
)
    
def convert_grades(grades):
    '''
    Fill in this function to convert the given DataFrame of numerical
    grades to letter grades. Return a new DataFrame with the converted
    grade.
    
    The conversion rule is:
        90-100 -> A
        80-89  -> B
        70-79  -> C
        60-69  -> D
        0-59   -> F
    '''
    return None

solution

def convert(grades):
    if (grades >= 90) & (grades <= 100):
        grades = 'A'
    elif (grades >= 80) & (grades <= 89):
        grades = 'B'
    elif (grades >= 70) & (grades <= 79):
        grades = 'C'
    elif (grades >= 60) & (grades <= 69):
        grades = 'D'
    else:
        grades = 'F'
    return grades
    
def convert_grades(grades):
    return grades.applymap(convert)

Quiz: DataFrame apply()

Note: In order to get the proper computations, we should actually be setting the value of the “ddof” parameter to 0 in the .std() function.

Note that the type of standard deviation calculated by default is different between numpy’s .std() and pandas’ .std() functions. By default, numpy calculates a population standard deviation, with “ddof = 0”. On the other hand, pandas calculates a sample standard deviation, with “ddof = 1”. If we know all of the scores, then we have a population - so to standardize using pandas, we need to set “ddof = 0”.
.apply() used to convert columns(default) to columns and convert rows(with axis) to rows
.applymap() used to elements
quiz

import pandas as pd
grades_df = pd.DataFrame(
    data={'exam1': [43, 81, 78, 75, 89, 70, 91, 65, 98, 87],
          'exam2': [24, 63, 56, 56, 67, 51, 79, 46, 72, 60]},
    index=['Andre', 'Barry', 'Chris', 'Dan', 'Emilio', 
           'Fred', 'Greta', 'Humbert', 'Ivan', 'James']
)
# Change False to True for this block of code to see what it does
# DataFrame apply()
if False:
    def convert_grades_curve(exam_grades):
        # Pandas has a bult-in function that will perform this calculation
        # This will give the bottom 0% to 10% of students the grade 'F',
        # 10% to 20% the grade 'D', and so on. You can read more about
        # the qcut() function here:
        # http://pandas.pydata.org/pandas-docs/stable/generated/pandas.qcut.html
        return pd.qcut(exam_grades,
                       [0, 0.1, 0.2, 0.5, 0.8, 1],
                       labels=['F', 'D', 'C', 'B', 'A'])
        
    # qcut() operates on a list, array, or Series. This is the
    # result of running the function on a single column of the
    # DataFrame.
    print convert_grades_curve(grades_df['exam1'])
    
    # qcut() does not work on DataFrames, but we can use apply()
    # to call the function on each column separately
    print grades_df.apply(convert_grades_curve)
    
def standardize(df):
    '''
    Fill in this function to standardize each column of the given
    DataFrame. To standardize a variable, convert each value to the
    number of standard deviations it is above or below the mean.
    '''
    return None

solution

def std_col(col):
    return (col - col.mean()) / col.std(ddof=0)
    
def standardize(df):
    return df.apply(std_col)

Quiz: DataFrame apply() Use Case 2

.apply() convert columns to element
df.apply(np.max):=df.max()
quiz

import numpy as np
import pandas as pd
df = pd.DataFrame({
    'a': [4, 5, 3, 1, 2],
    'b': [20, 10, 40, 50, 30],
    'c': [25, 20, 5, 15, 10]
})
# Change False to True for this block of code to see what it does
# DataFrame apply() - use case 2
if False:   
    print df.apply(np.mean)
    print df.apply(np.max)
    
def second_largest(df):
    '''
    Fill in this function to return the second-largest value of each 
    column of the input DataFrame.
    '''
    return None

solution

def second_max(col):
    sorted_col = col.sort_values(ascending=False)
    return sorted_col.iloc[1]
    
def second_largest(df):
    return df.apply(second_max)

Quiz: Adding a DataFrame to a Series

code

import pandas as pd
# Change False to True for each block of code to see what it does
# Adding a Series to a square DataFrame
if False:
    s = pd.Series([1, 2, 3, 4])
    df = pd.DataFrame({
        0: [10, 20, 30, 40],
        1: [50, 60, 70, 80],
        2: [90, 100, 110, 120],
        3: [130, 140, 150, 160]
    })
    
    print df
    print '' # Create a blank line between outputs
    print df + s
    
# Adding a Series to a one-row DataFrame 
if False:
    s = pd.Series([1, 2, 3, 4])
    df = pd.DataFrame({0: [10], 1: [20], 2: [30], 3: [40]})
    
    print df
    print '' # Create a blank line between outputs
    print df + s
# Adding a Series to a one-column DataFrame
if False:
    s = pd.Series([1, 2, 3, 4])
    df = pd.DataFrame({0: [10, 20, 30, 40]})
    
    print df
    print '' # Create a blank line between outputs
    print df + s
    
    
# Adding when DataFrame column names match Series index
if False:
    s = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
    df = pd.DataFrame({
        'a': [10, 20, 30, 40],
        'b': [50, 60, 70, 80],
        'c': [90, 100, 110, 120],
        'd': [130, 140, 150, 160]
    })
    
    print df
    print '' # Create a blank line between outputs
    print df + s
    
# Adding when DataFrame column names don't match Series index
if False:
    s = pd.Series([1, 2, 3, 4])
    df = pd.DataFrame({
        'a': [10, 20, 30, 40],
        'b': [50, 60, 70, 80],
        'c': [90, 100, 110, 120],
        'd': [130, 140, 150, 160]
    })
    
    print df
    print '' # Create a blank line between outputs
    print df + s

Quiz: Standardizing Each Column Again

quiz

import pandas as pd
# Adding using +
if False:
    s = pd.Series([1, 2, 3, 4])
    df = pd.DataFrame({
        0: [10, 20, 30, 40],
        1: [50, 60, 70, 80],
        2: [90, 100, 110, 120],
        3: [130, 140, 150, 160]
    })
    
    print df
    print '' # Create a blank line between outputs
    print df + s
    
# Adding with axis='index'
if False:
    s = pd.Series([1, 2, 3, 4])
    df = pd.DataFrame({
        0: [10, 20, 30, 40],
        1: [50, 60, 70, 80],
        2: [90, 100, 110, 120],
        3: [130, 140, 150, 160]
    })
    
    print df
    print '' # Create a blank line between outputs
    print df.add(s, axis='index')
    # The functions sub(), mul(), and div() work similarly to add()
    
# Adding with axis='columns'
if False:
    s = pd.Series([1, 2, 3, 4])
    df = pd.DataFrame({
        0: [10, 20, 30, 40],
        1: [50, 60, 70, 80],
        2: [90, 100, 110, 120],
        3: [130, 140, 150, 160]
    })
    
    print df
    print '' # Create a blank line between outputs
    print df.add(s, axis='columns')
    # The functions sub(), mul(), and div() work similarly to add()
    
grades_df = pd.DataFrame(
    data={'exam1': [43, 81, 78, 75, 89, 70, 91, 65, 98, 87],
          'exam2': [24, 63, 56, 56, 67, 51, 79, 46, 72, 60]},
    index=['Andre', 'Barry', 'Chris', 'Dan', 'Emilio', 
           'Fred', 'Greta', 'Humbert', 'Ivan', 'James']
)
def standardize(df):
    '''
    Fill in this function to standardize each column of the given
    DataFrame. To standardize a variable, convert each value to the
    number of standard deviations it is above or below the mean.
    
    This time, try to use vectorized operations instead of apply().
    You should get the same results as you did before.
    '''
    return None
def standardize_rows(df):
    '''
    Optional: Fill in this function to standardize each row of the given
    DataFrame. Again, try not to use apply().
    
    This one is more challenging than standardizing each column!
    '''
    return None

solution

def standardize(df):
    return (df - df.mean()) / df.std(ddof=0)
def standardize_rows(df):
    mean_diff = df.sub(df.mean(axis='columns'), axis='index')
    return mean_diff.div(df.std(ddof=0, axis='columns'), axis='index')

Quiz: Pandas groupby()

code

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
values = np.array([1, 3, 2, 4, 1, 6, 4])
example_df = pd.DataFrame({
    'value': values,
    'even': values % 2 == 0,
    'above_three': values > 3 
}, index=['a', 'b', 'c', 'd', 'e', 'f', 'g'])
# Change False to True for each block of code to see what it does
# Examine DataFrame
if False:
    print example_df
    
# Examine groups
if False:
    grouped_data = example_df.groupby('even')
    # The groups attribute is a dictionary mapping keys to lists of row indexes
    print grouped_data.groups
    
# Group by multiple columns
if False:
    grouped_data = example_df.groupby(['even', 'above_three'])
    print grouped_data.groups
    
# Get sum of each group
if False:
    grouped_data = example_df.groupby('even')
    print grouped_data.sum()
    
# Limit columns in result
if False:
    grouped_data = example_df.groupby('even')
    
    # You can take one or more columns from the result DataFrame
    print grouped_data.sum()['value']
    
    print '\n' # Blank line to separate results
    
    # You can also take a subset of columns from the grouped data before 
    # collapsing to a DataFrame. In this case, the result is the same.
    print grouped_data['value'].sum()
    
filename = '/datasets/ud170/subway/nyc_subway_weather.csv'
subway_df = pd.read_csv(filename)
### Write code here to group the subway data by a variable of your choice, then
### either print out the mean ridership within each group or create a plot.

Quiz: Calculating Hourly Entries and Exits

In the quiz where you calculated hourly entries and exits, you did so for a single set of cumulative entries. However, in the original data, there was a separate set of numbers for each station.

Thus, to correctly calculate the hourly entries and exits, it was necessary to group by station and day, then calculate the hourly entries and exits within each day.

Write a function to do that. You should use the apply() function to call the function you wrote previously. You should also make sure you restrict your grouped data to just the entries and exits columns, since your function may cause an error if it is called on non-numerical data types.

If you would like to learn more about using groupby() in Pandas, this page contains more details.

Note: You will not be able to reproduce the ENTRIESn_hourly and EXITSn_hourly columns in the full dataset using this method. When creating the dataset, we did extra processing to remove erroneous values.

quiz

To clarify the structure of the data, the original data recorded the cumulative number of entries on each station at four-hour intervals. For the quiz, you just need to look at the differences between consecutive measurements on each station: by computing “hourly entries”, we just mean recording the number of new tallies between each recording period as a contrast to “cumulative entries”.

import numpy as np
import pandas as pd
values = np.array([1, 3, 2, 4, 1, 6, 4])
example_df = pd.DataFrame({
    'value': values,
    'even': values % 2 == 0,
    'above_three': values > 3 
}, index=['a', 'b', 'c', 'd', 'e', 'f', 'g'])
# Change False to True for each block of code to see what it does
# Standardize each group
if False:
    def standardize(xs):
        return (xs - xs.mean()) / xs.std()
    grouped_data = example_df.groupby('even')
    print grouped_data['value'].apply(standardize)
    
# Find second largest value in each group
if False:
    def second_largest(xs):
        sorted_xs = xs.sort(inplace=False, ascending=False)
        return sorted_xs.iloc[1]
    grouped_data = example_df.groupby('even')
    print grouped_data['value'].apply(second_largest)
# --- Quiz ---
# DataFrame with cumulative entries and exits for multiple stations
ridership_df = pd.DataFrame({
    'UNIT': ['R051', 'R079', 'R051', 'R079', 'R051', 'R079', 'R051', 'R079', 'R051'],
    'TIMEn': ['00:00:00', '02:00:00', '04:00:00', '06:00:00', '08:00:00', '10:00:00', '12:00:00', '14:00:00', '16:00:00'],
    'ENTRIESn': [3144312, 8936644, 3144335, 8936658, 3144353, 8936687, 3144424, 8936819, 3144594],
    'EXITSn': [1088151, 13755385,  1088159, 13755393,  1088177, 13755598, 1088231, 13756191,  1088275]
})
def get_hourly_entries_and_exits(entries_and_exits):
    '''
    Fill in this function to take a DataFrame with cumulative entries
    and exits and return a DataFrame with hourly entries and exits.
    The hourly entries and exits should be calculated separately for
    each station (the 'UNIT' column).
    
    Hint: Take a look at the `get_hourly_entries_and_exits()` function
    you wrote in a previous quiz, DataFrame Vectorized Operations. If
    you copy it here and rename it, you can use it and the `.apply()`
    function to help solve this problem.
    '''
    return None

solution

def hourly_entries_and_exits(entries_and_exits):
    return entries_and_exits - entries_and_exits.shift(1)
    
def get_hourly_entries_and_exits(entries_and_exits):
    grouped_data = entries_and_exits.groupby('UNIT')
    return grouped_data[['ENTRIESn', 'EXITSn']].apply(hourly_entries_and_exits)

Quiz: Combining Pandas DataFrames

In the merged table on the right, the join dates in the third and fourth rows should be 5/19 and 5/11, reflecting the account key mapping in the enrollments table.
quiz

import pandas as pd
subway_df = pd.DataFrame({
    'UNIT': ['R003', 'R003', 'R003', 'R003', 'R003', 'R004', 'R004', 'R004',
             'R004', 'R004'],
    'DATEn': ['05-01-11', '05-02-11', '05-03-11', '05-04-11', '05-05-11',
              '05-01-11', '05-02-11', '05-03-11', '05-04-11', '05-05-11'],
    'hour': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    'ENTRIESn': [ 4388333,  4388348,  4389885,  4391507,  4393043, 14656120,
                 14656174, 14660126, 14664247, 14668301],
    'EXITSn': [ 2911002,  2911036,  2912127,  2913223,  2914284, 14451774,
               14451851, 14454734, 14457780, 14460818],
    'latitude': [ 40.689945,  40.689945,  40.689945,  40.689945,  40.689945,
                  40.69132 ,  40.69132 ,  40.69132 ,  40.69132 ,  40.69132 ],
    'longitude': [-73.872564, -73.872564, -73.872564, -73.872564, -73.872564,
                  -73.867135, -73.867135, -73.867135, -73.867135, -73.867135]
})
weather_df = pd.DataFrame({
    'DATEn': ['05-01-11', '05-01-11', '05-02-11', '05-02-11', '05-03-11',
              '05-03-11', '05-04-11', '05-04-11', '05-05-11', '05-05-11'],
    'hour': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    'latitude': [ 40.689945,  40.69132 ,  40.689945,  40.69132 ,  40.689945,
                  40.69132 ,  40.689945,  40.69132 ,  40.689945,  40.69132 ],
    'longitude': [-73.872564, -73.867135, -73.872564, -73.867135, -73.872564,
                  -73.867135, -73.872564, -73.867135, -73.872564, -73.867135],
    'pressurei': [ 30.24,  30.24,  30.32,  30.32,  30.14,  30.14,  29.98,  29.98,
                   30.01,  30.01],
    'fog': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    'rain': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    'tempi': [ 52. ,  52. ,  48.9,  48.9,  54. ,  54. ,  57.2,  57.2,  48.9,  48.9],
    'wspdi': [  8.1,   8.1,   6.9,   6.9,   3.5,   3.5,  15. ,  15. ,  15. ,  15. ]
})
def combine_dfs(subway_df, weather_df):
    '''
    Fill in this function to take 2 DataFrames, one with subway data and one with weather data,
    and return a single dataframe with one row for each date, hour, and location. Only include
    times and locations that have both subway data and weather data available.
    '''
    return None

solution

1 2	def combine_dfs(subway_df, weather_df): return subway_df.merge(weather_df, on=['DATEn', 'hour', 'latitude', 'longitude'], how='inner')

Quiz: Plotting for DataFrames

Just like Pandas Series, DataFrames also have a plot() method. If df is a DataFrame, then df.plot() will produce a line plot with a different colored line for each variable in the DataFrame. This can be a convenient way to get a quick look at your data, especially for small DataFrames, but for more complicated plots you will usually want to use matplotlib directly.

In the following quiz, create a plot of your choice showing something interesting about the New York subway data. For example, you might create:

Histograms of subway ridership on both days with rain and days without rain
A scatterplot of subway stations with latitude and longitude as the x and y axes and ridership as the bubble size
If you choose this option, you may wish to use the as_index=False argument to groupby(). There is example code in the following quiz.
A scatterplot with subway ridership on one axis and precipitation or temperature on the other
If you’re not sure how to make the plot you want, try searching on Google or take a look at the matplotlib documentation. Once you’ve created a plot you’re happy with, share what you’ve found on the forums!
quiz

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
values = np.array([1, 3, 2, 4, 1, 6, 4])
example_df = pd.DataFrame({
    'value': values,
    'even': values % 2 == 0,
    'above_three': values > 3 
}, index=['a', 'b', 'c', 'd', 'e', 'f', 'g'])
# Change False to True for this block of code to see what it does
# groupby() without as_index
if False:
    first_even = example_df.groupby('even').first()
    print first_even
    print first_even['even'] # Causes an error. 'even' is no longer a column in the DataFrame
    
# groupby() with as_index=False
if False:
    first_even = example_df.groupby('even', as_index=False).first()
    print first_even
    print first_even['even'] # Now 'even' is still a column in the DataFrame
filename = '/datasets/ud170/subway/nyc_subway_weather.csv'
subway_df = pd.read_csv(filename)
## Make a plot of your choice here showing something interesting about the subway data.
## Matplotlib documentation here: http://matplotlib.org/api/pyplot_api.html
## Once you've got something you're happy with, share it on the forums!

solution

1
2
3

location = subway_df.groupby(['latitude', 'longitude'], as_index=False).mean()
scale = location['ENTRIESn_hourly'] / location['ENTRIESn_hourly'].std(ddof=0)
plt.scatter(location['latitude'], location['longitude'], s=scale)

Three-Dimensional Data

Now that you’ve worked with one-dimensional and two-dimensional data, you might be wondering how to work with three or more dimensions.

3D data in NumPy

NumPy arrays can have arbitrarily many dimensions. Just like you can create a 1D array from a list, and a 2D array from a list of lists, you can create a 3D array from a list of lists of lists, and so on. For example, the following code would create a 3D array:

a = np.array([
    [['A1a', 'A1b', 'A1c'], ['A2a', 'A2b', 'A2c']],
    [['B1a', 'B1b', 'B1c'], ['B2a', 'B2b', 'B2c']]
])

3D data in Pandas

Pandas has a data structure called a Panel, which is similar to a DataFrame or a Series, but for 3D data. If you would like, you can learn more about Panels here.

Conclusion

Final Project: Investigate a Dataset

Data Analysis Process

Setting Up Your System

Intro to CSVs

CSVs in Python

Python’s csv Module

Iterators in Python

Solutions

Fixing Data Types

Questions about Student Data

Investigating the Data

Problems in the Data

Removing an Element from a Dictionary

Solutions

Updated Code for Previous Exercise

Missing Engagement Records

Checking for More Problem Records

Tracking Down the Remaining Problems

Refining the Question

Exploratory Data Analysis

Solutions

Getting Data from First Week

Indulge Curiosity

Exploring Student Engagement

Debugging Data Analysis Code

Lessons Completed in First Week

Number of Visits in the First Week

Splitting out Passing Students

Quiz: Comparing the Two Student Groups

Quiz: Making Histograms

Visualizing data

Making histograms in Python

Making histograms of student data

Are your Results Just Noise?

Statistics

Correlation Does Not Imply Causation

Cheese and Bedsheet Tangling

A/B Testing

Predicting Based on Many Features

Machine Learning

Communication

Quiz: Improving Plots and Sharing Findings

Adding labels and titles

Making plots look nicer with seaborn

Adding extra arguments to your plot

Improving one of your plots

Sharing your findings

Data Analysis and Related Terms

Conclusion

Quiz Solutions

CSVs in Python

Investigating the Data

Problems in the Data

Missing engagement records

Checking for more problem records

Refining the Question

Getting Data from First Week

Debugging Data Analysis Code

Fixing Bug in within_one_week()

Lessons Completed in First Week

Number of Visits in the First Week

Splitting out Passing Students

Comparing the Two Student Groups

Making Histograms

Fixing the Number of Bins

Improving Plots and Sharing Findings

Quiz: Survey Says!

Numpy and Pandas for 1D Data

Introduction

Quiz: Gapminder Data

One-Dimensional Data in NumPy and Pandas

Quiz: NumPy Arrays

Quiz: Vectorized Operations

Quiz: Multiplying by a Scalar

Quiz: Calculate Overall Completion Rate

Quiz: Standardizing Data

Quiz: NumPy Index Arrays

Quiz: + vs. +=

Quiz: In-Place vs. Not In-Place

Quiz: Pandas Series

Quiz: Series Indexes